-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add limit to ArrowReaderBuilder to push limit down to parquet reader #3633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some minor nits, for consistency I think we should apply this to the async reader also prior to merge, should be a simple copy-paste
I also wonder if we should allow specifying an offset, in addition to a limit 🤔
Self { | ||
selectors: vec![ | ||
RowSelector::select(limit), | ||
RowSelector::skip(total_rows.saturating_sub(limit)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowSelector::skip(total_rows.saturating_sub(limit)), |
This shouldn't be necessary, it will get removed by RowSelection::trim
anyway. Does make me wonder if this method is even really needed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think the main thing it does is just to truncate a final select
if needed, but it can be a lot simpler if we do not need to append a skip
on the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry wrong method, yeah I don't think it makes sense then
@@ -371,6 +382,35 @@ impl RowSelection { | |||
self | |||
} | |||
|
|||
/// Limit this [`RowSelection`] to only select `limit` rows | |||
pub(crate) fn limit(mut self, mut limit: usize) -> Self { | |||
let mut remaining = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be more simply implemented as
Self {
selectors: intersect_row_selections(&self.selectors, &[RowSelector::select(limit)]),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, RowSelection
row count allow to less than file total row count 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intersect_row_selections
is a bit weird. It doesn't seem to coalesce selections. Not sure if that is intentional or not.
@@ -453,6 +467,17 @@ impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>> { | |||
selection = Some(RowSelection::from(vec![])); | |||
} | |||
|
|||
// If a limit is defined, apply it to the final `RowSelection` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to also do something similar for the async reader
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@@ -430,6 +432,17 @@ where | |||
return Ok((self, None)); | |||
} | |||
|
|||
// If a limit is defined, apply it to the final `RowSelection` | |||
if let Some(limit) = limit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will apply the limit to each row group separately, instead of the entire file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. Fixed this to track the limit across row groups.
Thank you |
Benchmark runs are scheduled for baseline = 3057fa5 and contender = a76ea1c. a76ea1c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3631
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?